This files contains an example of tuning a Stacked Model with BayesSearchCV.

Load Data

Transformation Pipeline

Model

skopt.BayesSearchCV

https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html

https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

max_depth: 3–10 n_estimators: 100 (lots of observations) to 1000 (few observations) learning_rate: 0.01–0.3 colsample_bytree: 0.5–1 subsample: 0.6–1

Then, you can focus on optimizing max_depth and n_estimators. You can then play along with the learning_rate, and increase it to speed up the model without decreasing the performances. If it becomes faster without losing in performances, you can increase the number of estimators to try to increase the performances.

Find tuning options with:

bayes_search.get_params().keys()

Note that the param will be e.g. final_estimator__max_depth even though bayes_search.get_params().keys() returns estimator__final_estimator__max_depth

NOTE: i ran into an issue when using passthrough=True (via BayesSearchCV, which isn't the problem), because of an issue related to https://github.com/scikit-learn/scikit-learn/issues/16473

I get

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Which is probably caused by the same issue in the link with the error:

make_column_selector can only be applied to pandas dataframes

The github issue says that While everything is going [good] if X is only numerical, things start to be complicated when we are dealing with mixed types and dataframe.

Thus I changed removed the non_numeric transformations and added , remainder="drop" so that it only passes though the non-numeric data.

This didn't work, probably because the issue is before the transformation. Otherwise the OneHotEncoding would have worked because all columns are numeric afterward.

Results

Best Scores/Params

BayesSearchCV Performance Over Time


Variable Performance Over Time


Scatter Matrix


Variable Performance - Numeric


Variable Performance - Non-Numeric

No non-numeric Variables


Individual Variable Performance


Regression on roc_auc Mean

Feature Importance

https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

NOTE: foreign worker seems like it should be important but is ranked last in feature importance.